u-process and clustering performance
On U-processes and Clustering Performance
Stéphan Clémençon LTCI UMR Telecom ParisTech/CNRS No. 5141 - Institut Telecom Motivation Pairwise dissimilarity-based clustering techniques are widely used to segment a dataset into groups, such that data points in the same group are more similar to each other than to those in other groups. The empirical criteria these algorithms seek to optimize are of the form of U-statistics of degree two. We propose to analyze their performance, using recent advances in the theory of U-processes. The statistical framework considered permits to establish learning rates for the excess of clustering risk and to design model selection tools as well. Optimal partitions are those that minimize W (P). Pairwise-based clustering can be cast in terms of minimization of a U-statistic over a class Π of partition candidates.